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Abstract. Pattern recognition is a central topic in Learning Theory with numerous 
applications such as voice and text recognition, image analysis, computer diagnosis. The 
statistical set-up in classification is the following: we are given an i.i.d. training set 
(Xi, Yi), . . . {X„, Yn) where Xi represents a feature and Yi £ {0, 1} is a label attached 
to that feature. The underlying joint distribution of {X, Y) is unknown, but we can learn 
about it from the training set and we aim at devising low error classifiers f : X ^ Y used to 
predict the label of new incoming features. 

Here we solve a quantum analogue of this problem, namely the classification of two 
arbitrary unknown qubit states. Given a number of 'training' copies from each of the states, 
we would like to 'learn' about them by performing a measurement on the training set. The 
outcome is then used to design mesurements for the classification of future systems with 
unknown labels. We find the asymptotically optimal classification strategy and show that 
typically, it performs strictly better than a plug-in strategy based on state estimation. 

The figure of meiit is the excess risk which is the difference between the probability of 
error and the probability of error of the optimal measurement when the states are known, that 
is the Helstrom measurement. We show that the excess risk has rate and compute the 
exact constant of the rate. 
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1. Introduction 

Statistical learning theory [1, 2, 3, 4] is a broad research field stretching over statistics and 
computer science, whose general goal is to devise algorithms which have the ability to learn 
from data. One of the central learning problems is how to recognise patterns [5], with practical 
applications in speech and text recognition, image analysis, computer-aided diagnosis, data 
mining. 

The paradigm of Quantum Information theory is that quantum systems carry a new type 

of information with potentially revolutionary applications such as faster computation and 
secure communication [6]. Motivated by these theoretical challenges. Quantum Engineering 
is developing new tools to control and accurately measure individual quantum systems [7]. In 
the process of engineering exotic quantum states, statistical validation has become a standard 
experimental procedure [8, 9] and Quantum Statistical Inference has passed from its purely 
theoretical status in the 70's [10, 11] to a more practically oriented theory at the interface 
between the classical and quantum worlds [12, 13, 14, 15]. 

In this paper we put forward a new type of quantum statistical problem inspired by learning 
theory, namely quantum state classification. Similar ideas have already appeared in the 
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physics [16, 17, 18, 19] and learning [20, 21, 22] literature but here we emphasise the close 
connection with learning and we aim at going beyond the special models based on group 
symmetry and pure states. However, we hmit ourselves to a two dimensional state which 
could be regarded as a toy model from the viewpoint of learning theory, but hope that more 
interesting applications will follow. 

Before explaining what quantum classification is, let us briefly mention the classical set-up 
we aim at generalising. In supervised learning the goal is to learn to predict an output y G y, 
given the input (object) x E X, where input and output are assumed to be correlated and 
have an unknown joint distribution P over X x y. To do this, we are first provided with a set 
of n previously observed inputs with known output variables (called training examples), i.e. 
independent random pairs {Xi, Yi),i = l,...,n drawn from P. Using the training set, we 
construct a function /i„ : A" 3^ to predict the output for future, yet unseen objects. When 
y — {0, 1}, i.e. the output is a binary variable, this is called binary classification and is the 
typical set-up in pattern recognition. The input space is usually considered to be a subset 
of p-dimensional space MP, so that the object x can be described by p measurement values 
often called features. This description is very general as it allows e.g. to handle categorical 
(non-numerical) values (encoded as integer numbers), images (e.g. measured brightness of 
each pixel corresponds to a separate feature), time series (features corresponds to the values 
of the signal at given times), etc. 

In this paper, we consider the classification problem in which the objects to be classified 
are quantum states. Simply, we have a quantum system prepared in either of two unknown 
quantum states and we want to know which one it is. As in the classical case, this only makes 
sense if we are also provided with training examples from both states, with their respective 
labels, from which we can learn about the two alternatives. How could such a scenario occur? 
Suppose we send one bit of information through a noisy quantum channel which is not known. 
To decode the information (the input in this case) we need to be able to classify the output 
states corresponding to the two inputs. Alternatively, the binary variable may be related to a 
coupling of the channel which we want to detect. 

Needless to say, quantum systems are intrinsically statistical and can be Teamed' only by 
repeated preparation, so that the problem is really the quantum extension of the classical 
classification problem. On the other hand this is related to the problem of state discrimination 
which in the case of two hypotheses, has an explicit solution known as the Helstrom 
measurement [11]. The point is that when the states are unknown, the Helstrom measurements 
is itself unknown and has to be learned from the training set. An intuitive solution would be a 
plug-in procedure: first estimate the two states, and then apply the Helstrom measurement 
corresponding to the estimates on any new to-be-classified state. This indeed gives a 
reasonable classification strategy, but as we will see, this is not the best one. The optimal 
strategy in the asymptotic framework is to directly estimate the Helstrom measurement 
without intermediate states estimation. The optimality is defined by the natural figure of 
merit called excess risk, which is the difference between the expected error probabihty and 
the error probability of the Helstrom measurement. We show that the excess risk converges 
to zero with the size of the training set as and the ratio between the optimal and state 
estimation plug-in risk is a constant factor. 

Our analysis is valid for arbitrary mixed states and is performed in a pointwise, local minimax 
(rather than Bayesian) setting which captures the behaviour of the risk around any pair 
of states. The key theoretical tool is the recently developed theory of local asymptotic 
normality (LAN) for quantum states [23, 24, 25, 26] which is an extension of the classical 
concept in mathematical statistics introduced by Le Cam [27]. Roughly, LAN says that the 
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collective state pg of n i.i.d. quantum systems can be approximated by a simple Gaussian 
state of some classical variables and quantum oscillators. This was used to derive optimal 
state estimation strategies for arbitrary mixed states of arbitrary finite dimension, and also 
in finding quantum teleportation benchmarks for multiple qubit states [28]. In this paper, 
LAN is used to identify the (asymptotically) optimal measurement on the training set as 
linear measurement on two harmonic oscillators. Similarly to the case of state estimation 
such collective measurements perform strictly better than the local ones [29, 30]. Moreover, 
optimal learning collective measurement is different from the optimal measurement for state 
estimation, showing once again that generically, different quantum decision problems cannot 
be solved optimally simultaneously. 

Related work. Sasaki and Carlini [16] defined a quantum matching machine which aims at 
pairing a given 'feature' state with the closest out of a set of 'template' states. The problem 
is formulated in a Bayesian framework with uniform priors over the feature and template 
pure states which are considered to be unknown. Bergou and Hillery [17 1 introduced a 
discrimination machine, which corresponds to our set-up in the special case when the training 
set is of size n — 1. The papers [18, 19] deal with the problem of quantum state identification 
as defined in this paper. The special case of Bayesian risk with uniform priors over pure states 
was solved in [18], with the small difference that the learning and classification steps are done 
in a single measurement over n+1 systems. However, as in the case of state estimation [31], 
the proof relies on the special symmetry of the prior and does not cover mixed states. Finally, 
the concept of quantum classification was already proposed in a series of papers [20, 21, 22]. 
However, the authors mostly focused on problem formulation, reduction between different 
problem classes and general issues regarding learnability. Other related papers which fall 
outside the scope of our investigation are [32, 33]. 

This paper is organised as follows. Section 2 gives a short overview of the classical 
classification set-up and introduces its quantum analogue. Section 3 discusses the LAN theory 
with emphasis on the qubit case. In section 4 we reformulate the classification problem in the 
asymptotic (local) framework, as an estimation problem with quadratic loss for the training 
set. The main result is Theorem 5.1 of Section 5 which gives the mimimax excess risk for 
the case of known priors. The case of unknown priors is treated Section 5.2. The optimal 
classifier is compared to the plug-in procedure based on optimal state estimation in Section 
5.1. The geometry of the problem is captured by the Bloch ball illustrated in Figure 4. We 
conclude the paper with discussions. 

2. Classical and quantum learning 

2.1. Classical Learning 

Let {X, Y) be a pair of random variables with joint distribution P over the measure space 

{X X {0, 1}, E). In the classical setting X is usually a subset of MP and y is a binary variable. 

In a first stage we are given a training set of n i.i.d. pairs {{Xi,Yi), . . . , with 
distribution P, from which we would like to 'learn' about P. In the second stage we are 
presented with a new sample X and we are asked to guess its unseen label Y. For this we 
construct a (random) classifier 

which depends on the data {Xi,Yi), . . . , Yn). Its overall accuracy is measured in terms 
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of the expected error rate according the data distribution P, 

where Ic is the indicator function equal to 1 if C is true, and otherwise. However the 
error rate itself does not give a good indication on the performance of the learning method. 
Indeed, even an 'oracle' who knows P exactly has typically a non-zero error: in this case the 
optimal h is the Bayes classifier which chooses the label that is more probable with respect to 
conditional distribution P(y|a;) 

^^^-\ 1 if r?(X)>l/2 
where r]{x) := P(F = l|a;). The Bayes riskis 

Pe(/i*) = E[E[lf^.^^^^y\X]] = i (1 - E[|l - 2r,{X)\]) . 

An alternative view of the Bayes classifier which fits more naturally in the quantum set-up 
is the following. We are given data X whose probability distribution is either Fo{X) := 
¥{X\Y = 0) or Pi(X) := ¥{X\Y = 1) and we would Uke to test between the two 
hypotheses. We are in a Bayesian set-up where the hypotheses are chosen randomly with 
prior distributions tTj = P(F = i). The optimal solution of this problem is the well known 
likelihood ratio test: we choose the hypothesis with higher UkeUhood 



h*{X) 



if 7roPo(X) >7riPi(X) 

1 if 7roPo(X) <^iPi(X) 



which can be easily verified to be identical to the previously defined Bayes classifier. The 
Bayes risk can be written as 

P: = ^(l-|koPo-7ripi||i), (2) 

where pi are the densities of P(X|F = i) with respect to some connmon reference measure. 

Retuming to the classfication set-up where P is unknown, we see that a more informative 
performance measure for /i„ is the excess risk: 

R{K) = ¥e{hn)-¥e{h*)>0 (3) 

which measures how much worse the procedure hn performs compared to the performance 
of the oracle classifier. In statistical leaming theory one is primarily interested in consistent 
classifiers, for which the excess risk converges to as n — >^ oo, and then in finding classifiers 
with fast convergence rates [2, 3]. But how to compare different learning procedures? One 
can always design algorithms which work well for certain distributions and badly for others. 
Here we take the statistical approach and consider that all prior information about the data is 
encoded in the statistical model {Vg : G 8} i.e. the data comes from a distribution which 
depends on some unknown parameter 9 belonging to a parameter space 6. The later may be a 
subset of R*^ (parametric) or a large class of distributions with certain 'smoothness' properties 
(non-parametric). One can then define the maximum risk of hn 

RmaxiK) ■= SUpR0{hn) 
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where Re denotes the excess risk when the underiying distribution is P^i. A procedure /i„ is 
called minimax if its maximum risk is smaller than that of any other procedure 

Rmax{hn) = RmaxiK) = inf SUp i?e(^n). (4) 

hn h„ oee 

Alternatively one can take a Bayesian approach and optimise the average risk with respect to 
a given prior over O. 

Example 2.1. Let iX,Y) e {0, 1}^ with unknown parameters 77(0), r?(l) and P(X = 0), 

satisfying 7^(0) < 1/2 and 77(1) > 1/2. Then the Bayes classifier is h*{Qi) = and 
h*{l) = 1. On the other hand, from the training sample one can estimate rj{i) and obtain the 
concentration result 

P[?7„(0) < 1/2 and 77„(1) > 1/2] = 1 - 0(cxp(-cn)). 

Thus the plug-in estimator hn obtained by replacing r] by fju in (1) is equal to h* with high 
probability and the excess risk is exponentially small. 

The crucial feature leading to exponentially small risk was the fact that the regression function 
rjiX) is bounded away from the critical value 1/2. This situation is rather special but shows 
that the behaviour of the excess risk depends on the properties of rj around the value 1/2. Let 
us look at another simple example with a different behaviour. 

Example 2.2. Let {X, Y) eRx {0, 1} with 

F{X\Y = 0) = N{a,l), V{X\Y ^ I) = N{b,l) 

for some unknown means a < b, and V{Y = 0) = 1/2. From Figure 2.1 we can see that 
Po{x) < Pi (x) if and only if x > {a + b)/2 so that the Bayes classifier is 



h*{x) = 



if x<{a + b)/2 

1 if x>{a + b)/2 



The Bayes risk is equal to the orange area under the two curves. Again a natural classifier 
is obtained by estimating the midpoint (a + 6)/2 and plugging into the above formula. The 
additional error is the area of the green triangle. Since (a + fo)/2 — (a + b) /2 « l/v^ one 
can deduce that 

R{hn) = O(n-i), 
and it can be shown that this rate of convergence is optimal [34]. 

From this example we see that the rate is determined by the behaviour of the regression 
function r} around 1 /2, namely in this case 

F{\r]{x)-ll2\<t) = 0{t), t>0 

which is called the margin condition. Roughly speaking, in a parametric model satisfying the 
margin condition, the excess risk goes to zero as O [rT^^. In non-parametric models (which 
are the main focus of learning theory), arbitrarily slow rates are possible depending on the 
complexity of the model and the behaviour of the regression function [34]. 

According to Vapnik [3], one of the principles of statistical leaming is: "when solving a 
problem of interest, do not solve a more general problem as an intermediate step." This is 
interpreted as saying that leaming procedures which estimate first the statistical model (or 
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Figure 1. Likelihood functions for two normal distributions with means a, b. The Bayes risk 
is the area of the orange triangle. The excess risk is the area of the green triangle 



regression function) and then plug this estimate into the Bayes classifier, are less efficient 
than methods which aim at constructing h{x) directly. Recently it has been shown [34] that 
this is not necessarily the case if some type of margin condition is assumed, and that plug-in 
estimators 

^PLUG-In(2:) — ^fl{x)>l/2- (5) 

can perform close to, or at 'fast n^^ rates' . In this paper we show that at least in what concerns 
the constant in front of the rate, direct quantum learning performs better than plug in methods 
based on optimal state estimation. This is a purely quantum phenomenon which stems from 
the incompatibility between the optimal measurements for estimation and learning. 



2.2. Quantum Learning 

We now consider the quantum counterpart of the learning problem, the classification of 
quantum states. In this case, X is replaced by a Hilbert space of dimension d. To find the 
counterpart of P we write P(da:, y) = P{dx\y)P{y) and replace the conditional distributions 
V{dx\y = 0) and ¥{dx\y = 1) by density matrices p and cr, while P{y) describes prior 
probabilities over the states, usually denoted by TTy :— P{Y = y). There is no direct 
counterpart of the object x, since the quantum state is identified with its description in terms 
of a density matrix; however, one can think of x as a set of values obtained by measuring the 
state p. 

The training set consists of n i.i.d. pairs {(ti, Yi), . . . , (t„, i^„)}, where Ti = pifYt — {) and 
Ti = a if Yi — 1. Thus we are randomly given copies of p and cr together with their labels, 
but we do not know what p and a are. After a permutation the joint state of the training set can 
be concisely written as p'*"o (g) cr'*"i, where Uy is the number of copies for which Yj = y. 

The experimenter is allowed to make any physical operations on the training set (such as 
unitary evolution or measurements) and outputs a binary- valued measurement with POVM 
elements A/„ :— (P„, 1 — P„). This (random) POVM plays the role of the classical classifier 
hn- given a new copy of the quantum state whose label is unknown, we apply the measurement 
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Table 1. Comparison of classical and quantum learning. 



element classical learning 



quantum learning 



distribution P 

training example {x, y) 

training set {(a;i, j/i), . . . , (a:„, 
function classifier h 

optimal function ft*(a;) = l,,(a;)>i/2 
minimumrisk |(1 - E[|l - 2?7(X)|]) 

risk P(/i(X) ^Y)-K 



(p, a) with priors (ttq, tti) 
(p, 0) or (a, 1) 




M„ to guess whether the state is p or cr. The accuracy is measured in terms of the expected 
misclassification error: 



where the expectation is taken over the outcomes P„. 

The Bayes classifier M* is nothing but the Helstrom measurement [11] which optimally 

discriminates between known states p, cr with priors ttq, tti. In this case M* = {P* , 1 — P*) 
where P* is the projection onto the subspace of positive eigenvalues of the operator Ttop— ttict, 
i.e. P* = [ttoP — Tticr]-!-. Note that if both eigenvalues are of the same sign, the optimal 
procedure is to choose the state with higher tt^ without making any measurement at all. The 
Helstrom risk can be expressed as: P* = 5 (1 — Tr[|7rio- — 7rop|]) . which is the quantum 
extension of (2). 

As before, the performance of an arbitrary classifier M„ is measured by the excess risk: 



which is expected to vanish asymptotically with n. 

In Table 1 we summarise the analogous concepts in the classical and the quantum learning 
set-up. Besides these obvious correspondences we would like to point out some interesting 
differences. Based on the coin toss example 2. 1 one may expect that the classification of two 
qubit states should exhibit similar exponentially fast rates. In fact as we will show in this 
paper, the rate is as in example 2.2 where the data is not discrete but continuous and the 
regression function is not bounded away from 1/2. A possible explanation is the fact that 
in the quantum case the 'data' to be labelled is a quantum system and the distribution of the 
outcome depends on the measurement. A helpful way to think about it is illustrated in Figure 

2.2. The unknown label is the input of a black box which outputs the data X with conditional 
distribution P(X|y). In the quantum case the box has an additional input, the measurement 
choice which appears as a parameter in the conditional distribution and is controlled by the 
experimenter. The game is to learn from the training set the optimal value of this parameter, 
for which the identification of the label Y is most facile. This set-up resembles that of active 
learning [35] where the training data Xi are actively chosen rather than collected randomly. 

2.3. Local minimax formulation ofoptimality 



Pe(M„) = E [7roTr[p(l - P„)] + 7riTr[aP„] 



P(M„) = Pe(M„) - P: = ETr Uwia - 7Top){Pn - P* 



) 



(6) 



We now give the precise formulation of what we mean by asymptotic optimality of a learning 
strategy {M„ : n G N}. As in the classical case we construct a model which contains all 
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Figure 2. Quantum learning seen as classical learning with data distribution depending on on 
additional parameter controlled by the experimenter 

unknown parameters of the problem: the two states p, a and the prior ttq. We denote these 
parameters collectively by 9 which belongs to a parameter space 6 C M*^. When some prior 
information is available about the model, it can be included by restricting to a sub-model of 
the general one. As in the classical case we denote by Rg{Mn), the risk of Af„ at 0, and 
we can define the maximum risk as in (4). However, assuming for the moment that that the 
optimal rate of classification is n~^, we use a more refined performance measure which is the 
local version of the maximum risk Rmax around a fixed parameter 

R^^axiMn ; ^o) := sup nRg{Mn) (7) 

||9-eo||<n-i/2+« 

where e > is a small number. Note that in the above definition the usual risk was multiplied 
by the inverse of its rate n so that we can expect Rmax to have a non-trivial Umit when 
n ^ oo. The reason for choosing the local maximum risk is that it reflects better the difficulty 
of the problem in different regions of the parameter space while the maximum risk captures 
the worst possible behavior over the whole parameter space. We can think of the local ball 
11^^ ~ ^oll < n~^/^+"^ as the intrinsic parameter space when the training set consists of n 
samples. Indeed a simple estimator on a small proportion h = r?,^^*^ of the sample locates 
the true parameter in such a ball with high probability (see Lemma 2. 1 in [24]). 

Definition 2.1. The local minimax risk at 9q is defined as 

Rminmaxi^o) ■= Um sup inf ii^^)^^ (M„ ; eo). 

n—>-oo Mn 

A sequence of classifiers {Mn '■ n gN} is called locally asymptotic minimax if 
limsup i?W,(M„ ; ^o) = i2^l_(^o). 

We identify two general learning strategies. The first one consists in estimating the states p, a 
and prior ttq (optimally) to get p, a, ttq and then constructing the classifier (measurement) as: 

-PpLUG-iN = [ttoP - 7ri(T] + . (8) 

The second strategy aims at estimating the Helstrom projection P* directly from the training 
set without passing through state estimation. As we will see, it turns out that in general the 
latter performs better than the former. 

In section 3 we review the concept of local asymptotic normality which means that locally, 
the training set can be efficiently approximated by a simple Gaussian model consisting of 
displaced thermal equlibrium states and classical Gaussian random variables. In section 4 we 
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show how to reduce the local classification risk for qubits to an expectation of a quadratic form 
in the local parameters. This will simplify the problem of finding the optimal measurement of 
the training set, to that of finding the optimal measurement of a Gaussian state for a quadratic 
loss function [10]. 

3. Local asymptotic normality 

In a series of papers [23, 24, 25J Gu{a and Kahn and Guja and Jencova [26] developed a 
new approach to state estimation based on the extension of the classical statistical concept of 
local asymptotic normality [27]. Using this tool one can cast the problem of (asymptotically) 
optimal state estimation into a much simpler one of estimating the mean of a Gaussian state 
with known variance. 

Local asymptotic normality provides a convenient description of quantum statistical models 
involving i.i.d. quantum states which can also be applied to the present learning problem. In 
this section we will give a brief introduction to this subject in as much as it is necessary for 
this paper and we refer to [25] for proofs and a more in depth analysis. 

3.1. Local asymptotic normality in classical statistics 

A typical statistical problem is the estimation of some unknown parameter 9 from a sample 
Xi, . . . ,Xn e X of independent, identically distributed random variables drawn from a 
distribution Pg over a measure space {X, S). If 9 belongs to an open subset of M'^ for some 
finite dimension k and if the map — ^ is sufficiently smooth, then widely used estimators 
9niXi, . . . , Xn) such as the maximum likelihood are asymptotically optimal in the sense that 
they converge to ^ at a rate n~^/^ and the error has an asymptotically normal distribution 



where the right side is the lower bound set by the Cramer-Rao inequahty for unbiased 
estimators. To give a simple example, if Xj e {0, 1} is the result of a coin toss with 
W[Xi = 1] = 6* and ¥[Xi = Q] = 1 - 6' then the sufficient statistic 



satisfies (9) by the Central Limit Theorem (CLT). 

Naturally, the first inquiries into quantum statistics concentrated on generalising the Cramer- 
Rao inequahty to unbiased measurements, and on finding asymptotically optimal estimators 
which achieve the quantum version of the Fisher information matrix [11, 10, 36]. However it 
was found that due to the additional uncertainty introduced by the non-commutative nature of 
quantum mechanics the situation is essentially different from the classical case. A summary 
of these finding is 

(i) the multi-dimensional version of the Cramer-Rao bound is in general not achievable; 

(ii) the optimal measurement depends on the loss function, i.e. the quadratic form 
{9 — 9yG{9 — 9) and different weight matrices G lead in general to incompatible 
measurements. 




(9) 
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As we will see, these issues can be overcome by adopting a more modern perspective 
to asymptotic statistics provided by the technique of local asymptotic normality [27, 37]. 
Instead of analysing particular estimation problems, the idea is to consider the structure of 
the statistical model underlying the data and to approximate it by a simpler model for which 
the statistical problems are easy to solve. In order to obtain a non-trivial Hmit model it makes 
sense to rescale the parameters according to their uncertainty, so we assume that 9 is locahsed 
in a region of size n~^/^ and we can write ^ = + hj^fn with known and /i e M*^ 
the local parameter to be estimated. Such an assumption does not restrict the generality of 
the problem since one can use an adaptive two-steps procedure where a rough estimate Qq is 
obtained in the first step using a small part of the sample, and the rest is used for the accurate 
estimation of the local parameter h . 

Local asymptotic normaUty means that the sequence of (local) statistical models 

?'„:={P^„+,/v^:||/i||<c}, neN (10) 
depending 'smoothly' on h, converges to the Gaussian shift: model 

g:={N{h,r\eo)):\\h\\<c} (ii) 

where we observe a single Gaussian variable with mean h and fixed and known variance. The 
convergence has a precise mathematical definition in terms of the Le Cam distance between 
two statistical models which quantifies the extent to which each model can be 'simulated' by 
randomising data from the other. 

Deiuiition 3.1. A positive linear map 

T:L\X,A,¥)^L'{y,B,q) 

is called a stochastic operator (or randomisation) j/||T(p)||i = \\p\\i for every p e L^{X). 

For simpUcity we consider only dominated models for which all distributions have densities 
with respect to some fixed reference distribution. In this case a randomisation is the classical 
analogue of a quantum channel. 

Definition 3.2. Let V := {Pe : ^ e 6} and Q := {Qe : 9 G Q} be two dominated statistical 

models with distributions having probability densities pe := dP$/dF and qe := dQe/dQ,. 
The deficiencies d{V, Q) and S{Q,'P) are defined as 

5{V,Q) :=infsup||r(pe)-ge||i 
^ See 

5{Q,V) :=infsup||%e)-pe||i 

where the infimum is taken over all randomisations T, S. The Le Cam distance between V 
and Q is 

MV,Q) ■.= maxiSiQ,V), 6{V,Q)). 

With this definitions the local asymptotic normahty for i.i.d. parametric models can be 
formulated as 

Tlieorem 3.3. The sequence of local models (10) converges in the Le Cam distance to the 
Gaussian shift model (11) 

lim A(Vn,g) = 0. 

n— >oo 

This statement can be extended to slowly increasing local neighbourhoods \\h\\ < rf with 
precise convergence rate for the Le Cam distance. 
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3.2. Local asymptotic normality in quantum statistics 

We will now describe the quantum version of local asymptotic normality for the simplest case 
of a family of spin states. The general result valid for arbitrary finite dimensional systems can 
be found in [25]. 

We are given n spins independent identically prepared in the state 

Pf = + 

where f is the unknown Bloch vector of the state and a = {a^, Oy^a^) are the Pauli matrices 
in M(C^). Following the methodology of the previous section, we concentrate on the 
structure of the statistical model itself rather than optimal state estimation. The latter, and 
other statistical problems can be solved easily once the convergence to a Gaussian model is 
established. 

By measuring a small proportion v}^^ <C n of the systems we can devise an initial rough 
estimator po := Pra so that with high probabihty the state is in a ball of size rT^I"^^^ around 
po [23]. We label the states in this ball by the local parameter u 

Pu/^ = ^ (1 + (n) + u/Vn)a) 
and define the local statistical model by 

Qn:={p"u--\\u\\<n^}, pS:=p|/V (12) 

By choosing a coordinate system {ai,a2,a^) with along rg and writing u = uiai + 
U2a2 + u^d^ we observe that Pn/^is essentially obtained by perturbing the eigenvalues of 
po by U3/2y/n and rotating it with a 'small' unitary 

U := exp(z(-M2ai + Wia2)a/2roVn), Tq := ||ro||. 

The splitting into 'classical' and 'quantum' parameters and (^1,^2) can be intuitively 
explained through the 'big Bloch sphere' picture commonly used to describe spin coherent 
[38] and spin squeezed states [39]. Let 



L,- :=^a, -a«, i = 1,2,3 



be the collective spin components along the directions aj. By the Central Limit Theorem, the 
distributions of Li with respect to " converge as 

1(^3 _ nro) A N{0, 1 - r^), ^Li,2 A N{0, 1), 

so that the joint spins state can be pictured as a vector of length nro whose tip has a Gaussian 
blob of size y/n representing the uncertainty in the collective variables (see Figure 3.2). 
Furthermore, by a law of large numbers heuristic we estimate the commutators 



Jn Jn 



= 2i—Ls w 2irol, 
n 



Jn Jn 



0. 
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directions and wn(l — ry) in tlie z direction. 



This suggests that Li/^/2ron and L2I \/2rQn converge to the canonical coordinates Q and P 
of a quantum harmonic oscillator in a thermal equilibrium state 

00 

<i>:^(l-p)^/|fc)(A;|, V=--^. 
fe=o 

where {|fc) : k > 0} represents the Fock basis. Moreover the (rescaled) component 
^{L^ — nro) converges to a classical Gaussian variable X ^ N := N{Q, 1 — r^) which 
is independent of the quantum state. Note that the Gaussian limit state has both quantum 
and classical components and should be identified with the state $ (g) on the von Neumann 
algebra S(£2(N))(8)i°°(M). 

What is the Gaussian state when the spins are in the 'perturbed' state ? By applying 
the same argument we obtain that the variables Q,P,X pick up expectations which (in the 
first order in n^^^^) are proportional to the local parameters (ui, U2, M3) while the variances 
remain unchanged. More precisely the oscillator is in a displaced thermal equilibrium state 
:— D{u)^D{u)* , where D{u) is the displacement operator 

D{u) exp {i{-U2Q + uiP)/y/2j^) , 

and the classical bit has distribution Ns :— N{u3, 1 — r§). 

Definition 3.4. The quantum Gaussian shift model Q is defined by the family of quantum- 
classical states 

g ■.^{'^z^Na-.ueW'} (13) 

o«6(£2(N))(8)L°°(M). 

Having defined the sequence of local models Q„ and the Gaussian shift model, we need to 
define the quantum counterparts of randomisations and convergence of models. The natural 
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analogue of a classical randomisation is a quantum channel, i.e. completely positive, trace 
preserving map C : 71 ("H) 71 (/C) where %_{%) represents the trace class operators on 
However, as we saw above, a sequence of quantum statistical models may converge 
to a quantum-classical one. The mathematical framework covering randomisations of both 
classical and quantum statistical models is that of von Neuman algebras and channels between 
their preduals. In finite dimensions this simply means that we deal with channels between 
block diagonal matrix algebras. We can now define the Le Cam distance between two 
quantum models in the same way as in definition 3.2 with classical randomisation replaced by 
quantum ones and the || • ||i representing the norm on the predual, which is the trace norm in 
the case of density matrices. 

Theorem 3.5. Let Qn be the sequence of statistical models (12) for n i.i.d. local spin states, 
and let Qn be the restriction of the Gaussian shift model (13) to the range of parameters 
Hull < n'^. Then 

lim A(Q„,a„) =0, 
i.e. there exist sequences of channels T„ and S„ such that 

lim sup ||$s(8) A/a-T„(/9S)||i 
lim sup Wp'I- Sn{<^u(^Ns)\\i 

||u||<n« 

To conclude this section we would like to make a few comments on the significance of the 
above result. The first point is that although it was intuitively illustrated using the Central 
Limit Theorem, the concept of local asymptotic normahty provides a stronger characterisation 
of the 'Gaussian approximation'. Indeed the convergence in Theorem 3.5 is strong (in Li) 
rather than weak (in distribution), it is uniform over a range of local parameters rather than at 
a single point, and has an operational meaning based on quantum channels. 

Secondly, one can exploit these features to devise asymptotically optimal measurement 
strategies for state estimation and prove that the Holevo bound [10] is asymptotically 
attainable [40]. 

Thirdly, the result can be applied to other quantum statistical problems involving i.i.d. qubit 
states such as cloning, teleportation benchmarks, quantum learning, and can serve as a 
mathematical framework for analysing quantimi state transfer protocols. 

4. Local formulation of the classification problem 

In this section we reformulate the problem of quantum state classification in the 'local' set-up. 
This allows us to replace, on the one hand the excess error probabihty by a quadratic form in 
local parameters, and on the other hand the training set consisting of i.i.d. spins by a simpler 
Gaussian shift model. 

Throughout the section we restrict to the case where the priors ttq , tti are known. In Section 
5.2 we show that the results for known priors can easily be extended to unknown ones by 
simply estimating them from the coimts of p and a states in the training sample. 



0, 
0. 



(14) 
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4.1. The loss fiinction 

Recall that the classification problem is to discriminate between two unknown states p and a 
by learning from a training set of n labelled systems prepared randomly in one of the states 
with probabilities tto and tti . For this we measure the training set and produce an outcome 
which is itself a measurement M„ := (P„, 1 — P„) on C-^. The accuracy of the procedure is 
measured by the excess risk (6): 



i?(M„) = ETr (TTia - 7rop)(P„ - P*) 



(15) 



with P* = [ttqp — TTifr]-!-. Since any binary measurement is a mixture of projective POVM's 
[41], we can assume without loss of generality that P„ is a projection and pull back the 
randomness into the definition of the training set measurement. 

As explained in section 3.2, the a priori unknown states p and a can be localised with high 
probability in n~^/^+'^ neighbourhoods of po and ctq by sacrificing a small proportion of the 
training set systems; this means that po and ctq are known and can be used by the classification 
procedure. Let fb and sq be the Bloch vectors of po and ctq and let us parametrise their 
neighbourhoods as follows 

rQ + 

^ = <^v/v^ = ^ + + • ^^^^ 

Let Po ■— [ttoPo — 7i'iO'o]+ be the optimal projection corresponding to the pair (pojCo) 
and note that it can have dimension one, or it can be zero or identity. In the second case, 
the optimal measurement is trivial, one can guess the state without measuring by checking 
whether the operator nopo — ttiCTq is positive or negative. 

Lemma 4.1. Let {po, ctq) and (ttq, tti) satisfy 

Ikon) - TTiSoll < Itto - 7ri|. 
Then Pq is either zero or identity and the local minimax excess risk satisfies 
inf sup Pe(M„) - P* = 0(exp(-cn)) 

||«||,||i?||<n« 

for some c > 0. 

Proof. Note that the inequality is satisfied only if ttq ^ tti and it implies that 

7^0 - 7^1 , TTofo - TTl So ^ 
1"0P0 - TTiCTo = 1 H Cr 

2 V TTo - TTl 

it a positive or negative operator depending on the sign of ttq — tti. 

Since both eigenvalues of ttoPo ~ ttiCTo non-zero, there exists a constant ?7 > such that 

imphes that A is also a positive or negative operator. In fact, when n is large enough all 
TToPa/^ — i^\<J^i^ with II u|| , II ?|| <rf have this property for some other constant f]. 
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Consider a simple measurement on the training set where the states are measured separately 

in the three bases of the Pauli matrices and the outcomes averages are used to construct a 
estimators of the states Pzj^ and o^j^. Then by basic concentration inequalities we get 

P (I'To (pa/v^ - P-/^) + TTi ((T„-/^ - C7./^) > ??) < ea;p(-cn) 

which means that with exponentially small probability error the plug-in estimator of P* := 
[t^oPu/^/u ~ '''if''iT/Vn]+ ^ equal to P* which is zero or identity. 

□ 



From now on we will work under the assumption that 

IKofo - TTlSoll > ko - TTll, 

SO that Pq '■= [t^oPo — t^i<^q]+ is a one dimensional projection whose Bloch vector is 

^ _ dn _ TTpfo - ttiSq 
^° ~ lldoll Ikoro-TTiSoir 

The Helstrom projection P* for the pair of unknown states (p, a) has Bloch vector 



(17) 



d 



TTO 



TTi .So 



do + 



do + 



(18) 



where z := ttqu — ttiv is a relative parameter and d := do 



As discussed before, we can take the estimator M„ to be a projective measurement M„ := 
(P„, 1 — Pn), so to minimise the risk (15) we aim at producing an estimator P„ which is close 
to P* . Since the latter is obtained by rotating Po with angle of order n 



-1/2+e 



, we can assume 



without loss of generality that P„ has a Bloch vector p„ which is a small rotation of po so that 

Po + Zn/Vn 



Pn 



(19) 



with Zn = 0{n'^) a vector in the plane orthogonal to po. 
Expanding (18) and (19) in powers of n^^/^ we get 



P-Pn 



1 

+ - 
n 



do{do ■ {z- Zn)) 



do{\\z\\' - Wznli') , Moiido ■ Z)^ - (do ■ Zn)' 



2\\doP 



+ 



2iMor 



+ o(n-i). 



We now plug these expressions back into into (15) taking into account that Zn is perpendicular 
to do and obtain 



,(M„) - P: = ETr ((ttop - 7ria)(P - P„)) 



-Ed - [p-Pn) 



M\do 



Mz^-^Znf + o{n-^) 
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where z± = z — do{z ■ do) /WdoW^ is the projection of onto the plane orthogonal to do. 
It is clear now that the rate of convergence of the excess risk (15) is n~^, so it is meaningful 
to optimise the quantity nRmax{Mn), and the contribution coming from the o{n~^) term can 
be dropped. 

Since M„ is uniquely determined by z„ by (19), we define the quadratic loss function for the 
measurement on the training set in terms of local variables 

L{{u,v),Zn) := \ \\zj_-Znf, := (tTqW - TTitT) (20) 

4||do|| 

and the associated renormalised risk is Ru,v{Zn) •= ^L{(u, v), Zn). The local maximum risk 
(7) around {po, ctq) is then 

^maxi^n PO,Cro) ■= SUp RaA^n) 
\\u\\,\\v\\<n' 

= sup ^^Ell^l -l^f. (21) 

||M||,||'iT||<n'^ 4j|(io| 

In conclusion, we need to find the optimal measurement strategy on the training set with 
respect to the above quadratic form of the local parameters. 

4.2. The training set 

To solve the above problem we employ the machinery of local asymptotic normality. As 
before, let p and a be states in local neighbourhood of po and respectively ctq described by 
(16). We write their local Bloch vectors (u, v) as 

u = Wiol + ^202 + ■"203 and v = V\b\ + V2b2 + v^bz 

where (ai,a2,a3) and (61,62,^3) are two coordinate systems which satisfy the conditions 
(see Figure 4) 

(i) 03 is parallel to ro> 

(ii) 63 is parallel to sq > 

(iii) ai, 61 are in the plane (ro, so)> 

(iv) 0,2 = &2 is perpendicular to the plane (r, s). 

With these notations the local statistical model for the training set is 

7;:= {pT^'^T' ■M,M<n'} 

and the corresponding Gaussian shift model is 

g^'^^ ■.= {Ns(E>Nff(g,^s<^^v-u,vGM.^} (22) 

where 



Nu :=N{y/^U3,l-rl), 
N^:= Ni^V3,l-sl), 




Quantum learning: optimal classification of qubit states 



18 



and $(g, p, v) is a displaced thermal equilibrium state with means {q, p) and variance v. 
The following technical lemma shows that local asymptotic normality can be used to transfer 
the problem of the optimal classification from a training set consisting of qubits, to a Gaussian 
one. The arguments are rather standard though tedious, and since the same method has been 
used for finding the optimal estimation procedure for qubits [24], we refer to that paper for 
the proof. 

Lemma 4.2. Consider the problems of finding asymptotically optimal strategies for the 
models Tn and respectively Gn with respect to the loss function (20). Then the local minimax 
risks of both problems converge to the same constant which is the the minimax risk of the 
unrestricted Gaussian shift model Q^'^\ 

In conclusion, the measurement of the training set should be aimed at optimally estimating 
the two parameter vector z± directly, rather than using a 'plug-in' strategy where the three 
dimensional local parameters {u, v) are first (optimally) estimated and then the measurement 
Pn is constructed as in (8). We will come back to this point later on when the two methods 
will be compared. 

5. Optimal classifier 

In this section we formulate our main result characterising the asymptotically optimal 
measurement on the training set and derive the expression of the optimal excess risk. 
Summarising the previous section, we transformed the original problem into a parameter 
estimation one for the Gaussian shift model (22) with parameters (it, v) e x M^. The 
parameter to be estimated z±_ e is a linear transformation of {u, v) 

zi_ = z- do{z- do)/||(ior, z:= ttqu - ttiV 
i.e. we would like to minimise the risk 

Rmax{z;po,c^o) ■= supEL{{u, v) , = sup —^E\\z - z±\f . 

u,v u,v 4||(Jo|| 

Since the local parameters contain both classical and quantum components it is convenient 
to express the loss function L{{u, v), z) in terms of these components. Let (po, ^o, ^o) be the 
reference frame with Iq in the plane (rp, sq). Denote by ipo, ipi the angles between [tq, Iq) 
and respectively (sq, ^o) (see Figure 4). Then z±^ — zilo + Zkko with components 

zi = (7roCOS(/?oM3 - Tricosf^ius) + (TTosintpoWi +7i"isin(pit;i) := zl"^ + 

Zk = 7roW2 - 71"! 1^2 

where z; was split into a contribution coming from the 'classical' parameters (1*3, W3), and 
another one from the 'quantum' parameters. Since the classical and quantum parts of the 
Gaussian model are independent it is easy to verify that the optimal estimator ^can be written 
as 

t={zl'^ +zl'^)lo + Zkko 

where z^"^^ is the optimal estimator of Z;^'' and {zl'^\zk) are optimal estimators of {zi'^\zk) 
obtained by (jointly) measuring the two quantum Gaussian components. The excess risk can 
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where the classical and quantum contributions separate and can be optimised separately. The 
optimal choice for the classical estimator is 



TTo COS ipoXr — COS (fiX, 



where {Xr, Xs) ~ iV^ (g) denote the random variables making up the classical part of the 
limit Gaussian model. Its contribution to the excess risk is 



E 



) 



7ro(l - Tq) cos^ (po + 7ri(l - Sq) cos^ (pi. 



(24) 



On the other hand (^|^^ , ^fe) are the means of the canonical coordinates 

Q^'^ := \/2ro7rosin ipoQi + \/2so7ri sinyiQ2, 

Q^'^) := V2u^Pi - V2^iP2 (25) 

whose commutator is 

[Q^''\ Q^'^^] = *(2ro7ro sin 990 — 2so7ri sin(pi)l := id. 

Now, the optimal joint measurement of canonical variables is the heterodyne type where the 
non-commuting coordinates are combined with the coordinates of an additional oscillator 
prepared in a squeezed state [10, 24]. The optimal mean square error is 

E [{zj''^ - zj'^^f + {zk - Zkf] = Far(g(')) + Var{Q^''^) + \c\ 

= TTo sin^ ipo + TTi sin^ <pi + 1 

+ 2|7roro sintpo — TTiSo sint^il (26) 

Adding the classical and quantimi contributions (24) and (26) we obtain the minimax risk 



D(i) 

minmax 



. . 2 + 2|7rorosm(po - 7riSosm(^i| - rosocosy'ocos(^i 

(Po,o-oj = — ^r- (27) 

M\do\\ 



which only depends on the states {po, <to)> for given priors (ttq, tti). 

Theorem 5.1. Consider the quantum classification problem with training set p®'^"" (g) o-®'^!" 
where p, a are unknown qubit states and (ttq, tti) are known. 

Let RminmaxiPo^'^o) be the local minimax risk as defined in Section 2.3. Under the 
assumption (17), i?|^j„„„^(/?o, ctq) is given by (27). 
The optimal measurement consists of the following steps: 

(i) construct rough estimators of p and a by measuring systems; 

(ii) transfer the localised spins state by Tn as in Theorem 3.5 ; 

(Hi) perform the optimal coherent measurement of {Q^''\Q'^^'') and combine with classical 
estimator z\. to produce estimator Pn. 
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Figure 4. Bloch ball geometry of the learning problem. 

The unknown states are localised in the two yellow balls centred at ro and so and have local 
vectors u/ y/n. and v / y/n coloured in purple. 

The three reference systems (ai, 3.2, as), 621 b^) and (poi 'oi ^q) are coloured in red. 

The green equatorial plane is orthogonal to po and contains the estimator Zn and the vector to 
be estimated zj_ (coloured in purple). 



5.7. Plug-in classifier based on optimal state estimation 

Here we compute the asymptotics of the renormalised risk of the plug-in classifier based on 
optimal state estimation. 

The problem of optimal state estimation for mixed i.i.d. qubits was solved in the asymptotic 
local minimax setting in [24]. The optimal measurement procedure is adaptive and the first 
two steps are identical to those of Theorem 5.1 

(i) construct rough estimators of p and a by measuring ■n}~'^ systems; 

(ii) transfer the localised spins state by r„ as in Theorem 3.5 ; 

(iii) Perform separate heterodyne measurements on the modes [Qi^Pi) and {Q2,P2) and 
observe the classical components to obtain the estimators u„ and 

Once the states (local parameters) have been estimated we can classify new states by applying 
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the plug-in measurement M„ := (P„, 1 — P„) where P„ has Bloch vector 

~. i do + ^ 



z„ := zj_ := {ttoUu - 7riU„)_L. (28) 



Note that Zn was chosen to be the orthogonal component of z onto the vector po rather than 
z itself. However a simple Taylor expansion shows that the two estimators give the same 
leading order contribution to the risk. As before, the minimax risk is the expectation of 
the quadratic loss fimction L{{u,v),t) defined in (20), but now with having a different 
distribution compared with the optimal z. Again, we write z as 

z = zilo + Zkko = {zf + z1)lo + Zkko 

and the risk is 

Rma.i'^; PO, Cro) = \{zl'^ - S^'^f + {zl"'' - Z^^V + {Zk - ~Zkf 

4||rfo|| L 

While the contribution from the first term is given by (24), the 'quantum components' have 
different variances due to the fact that we used a different heterodyne measurement. By using 
(25) and the fact that heterodyne adds a factor 1/2 to the variance of canonical coordinates 
we obtain 



E 



(zp) - i["^f\^ = TTo sin^ ipo{ro + 1) + tti sin^ ¥'i(so + 1) 
E [{zk - ~Zkf] = ^o(ro + 1) + 7ri(so + 1) (29) 
Adding the three contributions we get 

4||(io||-Rmoa:(^; Po, o-q) = 2 + 7ro(ro sin^ <Po + '"o - cos^ ^q) 

-|-7ri(sosin^(pi -hso - SqCos^i^i). (30) 

Theorem 5.2. Consider the quantum classification problem with training set (g) o-®'^!" 

where p, a are unknown qubit states and (ttq, tti) are known. 

Under the assumption (17), the asymptotic renormalised maximum risk Rmax{z ; pq, uq) of 
the plug-in classifier (28) is given by (30). 

Comparing the minimax risk (27) with the risk (30) of the plug-in classifier we get 

Rmax{z] Po,(To) - R^^inniaxiPO , '^o) = 7roro(l ± sin <^o)^ + 7riSo(l T sin (yJi)^, 



with the signs are chosen according to the sign of ttoTq sin ipo — ttiSq sin (pi . This quantity is 
equal to zero if and only if sin (pQ = ^1 and sin (pi — ±1 which means that the vectors Tq and 
So are parallel and point in the same direction. For fixed priors, the difference is maximum 
when the ro and sq point in opposite directions and have length one. 

This can be easily understood from the Gaussian model. When the vectors are parallel then 
learning requires an optimal joint measurement of non-commuting variables (Qi — Q2,Pi — 
P2) whose risk is the same as that of heterodyning the oscillators first and constructing 
linear combinations. In the anti-parallel case we need to measure commuting variables 
{Qi -\-Q2,Pi — P2) which can be done directly, without any loss. 
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5.2. The case of unknown priors 

The analysis so far deals with known priors 7ro,7ri, which is the standard set-up usually 
considered in quantum statistics. In general, the priors may be unknown but can be estimated 
from the training set with a standard error Since the Helstrom measurement depends 

also on (tto, tti), this uncertainty will bring an additional contribution to the excess risk. To 
find it, one needs to go back to the derivation of the quadratic loss function and add another 
unknown local parameter S for the prior: ttq = 9o + ^/ V^- 
Then (18) becomes 

(f do 

p= ^ = — — r , (31) 

He'll 

By going through the same steps, we get to the quadratic loss function 

L{{u,v,6),Zn) := — 4-||zl +^(fo-|-so)_L (32) 

M\do\\ 

where (fg + so)_l is the component orthogonal to po. 

As before, the training set can be cast into a Gaussian model, with an additional independent 
component Z ~ N{5,'jto'jti). This means that when taking the expectation of L we get an 
additional factor 

||(ro + so)±f = ^0^1 11(^0 + ^o)^ 11^ 

4||rfo|| ^do\\ 



z 




■sfn 




^ 4 


<5(n}+so) 







6. Conclusions 



We solved the problem of classifying two qubit states in the asymptotic local minimax 
statistical framework. Asymptotically the problem reduces to that of optimally estimating 
a sub-parameter of a quantum Gaussian model consisting of two independent oscillators in 
displaced thermal states with unknown means. The estimator is then used to construct an 
approximation of the (unknown) Helstrom measurement which is used to classify unlabelled 
states. The optimal procedure has excess risk of order rT^ and we computed the exact 
constant factor ^minmax^P^i ^o) as function of the two unknown states. Except in the special 
case of states with parallel Bloch vectors, the optimal procedure performs strictly better 
than the plug-in classifier obtained by estimating the states and applying the corresponding 
Helstrom measurement. The difference is only a constant factor, but it would probably become 
significant in more interesting infinite dimensional models. 

Finally let us briefly discuss the Bayesian analogue of our result. In the Bayesian framework 
one would choose a 'regular' prior iJ,{dp x da) over the two types of states and try to find the 
(asymptoticaUy) optimal Bayes risk for this prior 

Rgpf := limsupinf n / ii{dp x da)R(^p^„){Mn). 

When the states are pure and the prior is uniform, this has been done (even non- 
asymptoticaUy) in [ 1 8] , but the proof relies on the synmietry of the prior and cannot be applied 
to general ones, and mixed states. Based on a similar analysis done for state estimation [42], 
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we expect that our result can be used to prove that 

{po,ao)n{dpo X dao). 

The intuitive explanation is that when n — >• oo the features of the prior /x are washed out and 
the posterior distribution concentrates in a local neighbourhood of the true parameter, where 
the behaviour of the classifiers is governed by the local minimax risk. Proving this relation is 
however beyond the scope of this paper. 
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